8  Tidy Data

8.1 Introduction

Working effectively with data in R is greatly simplified by the tidyverse, a collection of packages designed for data science. The tidyverse provides a consistent framework for data manipulation, visualization, and modeling, which helps students learn generalizable skills rather than package-specific tricks.

A central concept in the tidyverse is thetibble, a modern re-imagining of the traditional R data frame. Tibbles keep the familiar two-dimensional table structure but introduce improvements such as preserving variable types, supporting list columns, and displaying data more cleanly in the console. These features make them easier to use in practice, especially with large datasets.

Finally, the idea of tidy data lies at the heart of the tidyverse. According to Hadley Wickham’s definition, tidy data means each variable forms a column, each observation forms a row, and each type of observational unit forms a table. Tidy data creates a standardized structure that enables smooth use of functions across the tidyverse, reducing the need for ad hoc data reshaping and making analyses more transparent and reproducible.

8.2 Tibbles

  • What is a tibble?
    • Tibbles are data frames
    • But they try and enhance the regular “old” data frame from base R
  • To learn more
    • vignette("tibble")

Tibbles are data frames that enhance the regular “old” data frame from base R. They keep the same two-dimensional tabular structure but are designed to be more consistent, predictable, and user-friendly.

8.2.1 Creating tibbles

There are several ways to create tibbles depending on the source of the data.

  • From individual vectors

The simplest way is to build a tibble directly from vectors using tibble(). Inputs of length 1 are automatically recycled, and you can refer to variables you just created:

Code
tibble(
  x = 1:5,
  y = 1,
  z = x ^ 2 + y
)
  • Converting existing objects

You can convert existing data structures into tibbles with as_tibble():

Code
# From a data frame
head(as_tibble(iris))
Code
# From a list
as_tibble(list(x = 1:3, y = letters[1:3]))
  • Reading from external files

Packages in the tidyverse ecosystem return tibbles when reading data from files:

Code
# From CSV, TSV, or delimited text file
head(readr::read_csv("data/Chetty_2014.csv"))
Code
# From Excel files
# readxl::read_excel("data.xlsx")
  • Reading from databases

You can also obtain tibbles when working with databases using packages such as DBI and dbplyr:

library(DBI)
con <- dbConnect(RSQLite::SQLite(), "mydb.sqlite")
tbl(con, "tablename")   # returns a tibble-like object

A tribble is a transposed tibble, designed for small data entry in code. Column headings are defined by formulas that start with ~:

Code
tribble(
  ~x, ~y, ~z,
  #--|--|----
  "a", 2, 3.6,
  "b", 1, 8.5
)

The tibble and readr packages are part of the core tidyverse, so they load automatically with library(tidyverse). Other packages such as readxl and dbplyr belong to the tidyverse ecosystem. They follow the same principles and return tibbles, but you need to load them explicitly.

8.2.2 Features of tibbles

Tibbles make fewer automatic changes than base R data frames:

  • They never change the type of inputs (strings are not converted to factors).
  • They never change variable names.
  • They never create row names.

For example:

Code
tb <- tibble(
  `:)` = "smile",
  ` ` = "space",
  `2000` = "number"
)

These column names would not be valid in base R, but are allowed in a tibble.

There are two main differences between tibbles and base R data frames:

Printing Tibbles have a refined print method that shows only the first 10 rows and only as many columns as fit on the screen:

Code
tibble(
  a = lubridate::now() + runif(1e3) * 86400,
  b = lubridate::today() + runif(1e3) * 30,
  c = 1:1e3,
  d = runif(1e3),
  e = sample(letters, 1e3, replace = TRUE)
) |> head()

This design avoids overwhelming the console when printing large data frames.

If you need more output, you can adjust options:

  • print(n = , width = ) controls number of rows and columns.
  • Global options can be set with:
options(tibble.print_max = n, tibble.print_min = m)
options(tibble.print_min = Inf)     # always show all rows
options(tibble.width = Inf)         # always print all columns

Subsetting Most of the subsetting tools we have used so far generally subset the entire data frame. To pull out just a single variable or value, we can use $ and [[: - [[ extracts by name or position - $ extracts by name with less typing

Code
df <- tibble(
  x = runif(5),
  y = rnorm(5)
)

# Extract by name
df$x
[1] 0.1749291 0.5200598 0.2591938 0.3329369 0.6937035
Code
df[["x"]]
[1] 0.1749291 0.5200598 0.2591938 0.3329369 0.6937035
Code
# Extract by position
df[[1]]
[1] 0.1749291 0.5200598 0.2591938 0.3329369 0.6937035
Code
df[[1,1]]
[1] 0.1749291

8.3 Tidy up data

Structuring datasets to facilitate analysis is at the core of the principles of tidy data, as described by Hadley Wickham.

8.3.1 Tydy data

Tidy data follows three basic rules:

  • Each variable forms a column
  • Each observation forms a row
  • Each type of observational unit forms a table

When these rules are not followed, the dataset is considered untidy. Common signs of untidy data include:

  • Column headers are values instead of variable names
  • Multiple variables are stored in one column (for example, City_State)
  • Variables are stored in both rows and columns
  • Multiple types of observational units are stored in the same table
  • A single observational unit is stored in multiple tables
  • The dataset is either too long or too wide

8.3.2 Pivoting

Most data encountered in practice will be untidy. This is because most people are not familiar with the principles of tidy data, and data is often organised to facilitate uses other than analysis, such as making entry easier.

Two common problems to look for are:

  • One variable might be spread across multiple columns
  • One observation might be scattered across multiple rows

Usually, a dataset will only suffer from one of these problems.

To resolve them, the tidyr package provides two key functions:

  • pivot_longer()
  • pivot_wider()

These functions are illustrated with example datasets included in the tidyr package. The tables (table2, table4a) contain data on the number of tuberculosis (TB) cases recorded in different countries for the years 1999 and 2000. The variable cases represents the number of TB cases reported for a given country, year, and type of measure.

8.3.2.1 Pivot longer

A common problem is a dataset where some of the column names are not variable names, but values of a variable:

Code
table4a

To tidy a dataset like this, pivot the offending columns into a new pair of variables.

Steps:

  • Select the columns whose names are values, not variables. In this example, those are 1999 and 2000.
  • Choose the variable to move the column names to (here, year).
  • Choose the variable to move the column values to (here, cases).
Code
table4a |> 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")

In the final result, the pivoted columns are dropped, and new year and cases columns are created. Other variables, such as country, are preserved. The cases column now explicitly records the number of TB cases for each year and country.

8.3.3 Pivot wider

Function pivot_wider() is the opposite of pivot_longer(). You use it when an observation is scattered across multiple rows.

Code
table2

To tidy this up, analyze the representation in a similar way to pivot_longer():

  • The column to take variable names from (here, type).
  • The column to take values from (here, count).
Code
table2 |>
  pivot_wider(names_from = type, values_from = count)

In this result, values of type (cases and population) become separate columns, and their associated numbers from count fill in the values. This produces a clearer dataset where each row corresponds to a country and year with distinct variables for cases and population.

8.3.4 Separating

The separate() function is used to pull apart one column into multiple columns by splitting wherever a separator character appears. This is useful when a single column actually contains more than one variable.

Consider the dataset table3 included in the tidyr package:

Code
table3

Notice the rate column. It contains two variables combined into a single column: the number of cases and the population size, separated by a forward slash. To make the dataset tidy, these should be split into separate variables.

The separate() function takes the name of the column to split and the names of the new columns to create:

Code
table3 %>%
separate(rate, into = c("cases", "population"))

This produces two new columns, cases and population, replacing the original rate column. The new columns now contain integer values for the reported tuberculosis cases and the population in each country and year.

By default, separate() splits values wherever it sees a non-alphanumeric character, meaning any character that is not a number or letter. In the example above, it automatically detected and split at the forward slash.

If you want to be explicit, you can specify the character to split on with the sep argument:

Code
table3 %>%
separate(rate, into = c("cases", "population"), sep = "/")

This ensures that the column is split exactly where expected, giving you clearer control over the separation process.

8.3.5 Unite

The unite() function is the inverse of separate(). It combines multiple columns into a single column. This can be useful when two or more variables are stored in separate columns but logically belong together.

Consider the dataset table5:

Code
table5

In this table, the year of observation is split into two columns, century and year. To make the dataset easier to work with, we can combine these into a single column.

Code
table5 %>% 
  unite(new, century, year)

By default, unite() places an underscore (_) between the values from different columns. In this case, that would produce values like 19_99.

If we want the numbers to run together without any separator, we can control this with the sep argument:

Code
table5 %>% 
  unite(new, century, year, sep = "")

This produces a single column new with values such as 1999 and 2000, giving a cleaner representation of the year variable.

8.4 Relational Data

Many datasets consist of multiple tables that are connected in a meaningful way. Together, such collections are called relational data. Each table stores information about a particular type of observation, and the relationships among these tables allow us to draw richer conclusions.

Why not store everything in one giant table? Because different kinds of observations naturally belong in different tables. For example, aircraft information does not change across flights, and weather data apply to all flights departing at a specific time from the same airport.

Think: Storage Efficiency and Consistency

Keeping separate tables avoids duplication, reduces storage, and prevents inconsistencies when information changes. If the manufacturer name of a plane changes, updating one record in the planes table is enough, rather than updating thousands of flight records.

We use the nycflights13 package as an example again to illustrate. This package includes several tables describing all flights departing New York City in 2013.

Code
library(nycflights13)
flights %>% select(year:day, hour, origin, dest, tailnum, carrier) %>% head()
Code
airlines %>% head()
Code
planes %>% select(tailnum, manufacturer, model, year) %>% head()
Code
airports %>% select(faa, name, lat, lon) %>% head()
Code
weather %>% select(origin, year:day, hour, temp, wind_speed) %>% head()

Each table contains different but related information:

  • flights records each departure, including date, time, origin, destination, tail number, and carrier code.
  • airlines provides the full airline names corresponding to carrier codes.
  • airports gives the name and geographic location for each airport.
  • planes contains aircraft details such as manufacturer and model.
  • weather records hourly weather data for each origin airport.

These tables are linked by shared variables. For example:

  • carrier links flights to airlines.
  • tailnum links flights to planes.
  • origin and dest link flights to airports.
  • year, month, day, hour, and origin link flights to weather.

A natural question to ask is: how can we combine information across these tables to answer questions such as:

  1. What type of aircraft fly most often from JFK?
  2. Which airlines experience the longest delays?
  3. Are certain weather conditions associated with longer delays?

To answer these questions, we must understand keys and joins.

8.4.1 Keys and Relationships

A key identifies how observations in one table relate to those in another.

  • A primary key uniquely identifies each observation within a table.
  • A foreign key refers to a primary key in another table.

For example, in planes, the variable tailnum serves as a primary key because each aircraft has a unique tail number. In flights, the same variable acts as a foreign key since many flights can share the same aircraft.

We can verify whether a variable is a primary key by checking for uniqueness:

Code
planes %>% count(tailnum) %>% filter(n > 1)

No duplicates imply that tailnum is a valid primary key. To check for unmatched foreign keys:

Code
flights %>% anti_join(planes, by = "tailnum") %>% nrow()
[1] 52606

The result shows how many flights lack corresponding aircraft information.

Think: Diagnosing Key Issues

It’s common for foreign keys to have missing matches. This may happen when aircraft were retired, renamed, or missing from the record. Always check unmatched keys before joining tables.

8.4.2 Mutating Joins

Mutating joins combine variables from two tables based on matching key values. They allow us to enrich one dataset with information from another.

There are four types of mutating Joins:

  • inner_join(x, y): keeps only rows with matching keys in both tables.
  • left_join(x, y): keeps all rows from x, adding matches from y.
  • right_join(x, y): keeps all rows from y, adding matches from x.
  • full_join(x, y): keeps all rows from both tables.

8.4.2.1 Example: Add Plane Information

Code
flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier)

flights2 %>% left_join(planes, by = "tailnum")

This left join keeps every flight and adds aircraft details whenever a matching tailnum exists. Unmatched rows contain missing values.

8.4.2.2 Example: Add Airline Names

Code
flights2 %>% left_join(airlines, by = "carrier")

Now each carrier code is replaced with a readable airline name, making the dataset easier to interpret.

Think: Choosing the Right Join

left_join() is most common because it preserves the main dataset. Use inner_join() only when you are sure you want to exclude records that do not match.

8.4.3 Filtering Joins

Filtering joins keep or exclude observations in one table based on whether they have a match in another. They do not add new columns.

  • semi_join(x, y): keeps rows in x that have a match in y.
  • anti_join(x, y): keeps rows in x that do not have a match in y.

8.4.3.1 Example: Flights with Known Planes

Code
flights2 %>% semi_join(planes, by = "tailnum")

This keeps only flights that have aircraft information in planes.

8.4.3.2 Example: Flights with Unknown Planes

Code
flights2 %>% anti_join(planes, by = "tailnum")

These flights are missing aircraft details. anti_join() is especially useful for finding unmatched keys and diagnosing data quality issues. For instance, you might use it to identify which flights are missing weather information or which airports have no corresponding records.

Think: Using semi_join() and anti_join()

Use semi_join() to filter data based on membership, and anti_join() to find mismatches. They are efficient tools for quality control and exploratory checks.

8.4.4 Summary

Relational data represent multiple connected tables describing related entities. Keeping separate tables helps minimize redundancy, reduce errors, and maintain data consistency. Understanding how these tables are linked through keys allows us to integrate information accurately. The tidyverse join functions make it straightforward to enrich, filter, and explore relational datasets such as the nycflights13 example.